Model Selection

Zero-shot Video Retrieval

# Zero-shot Video Retrieval

LLaVE-7B is a 7-billion-parameter multimodal embedding model based on LLaVA-OneVision-7B, capable of embedding representations for text, images, multiple images, and videos.

Multimodal Fusion

Transformers English

LLaVE-2B is a 2-billion-parameter multimodal embedding model based on Aquila-VL-2B, featuring a 4K token context window and supporting embeddings for text, images, multiple images, and videos.

Transformers English

LLaVE is a multimodal embedding model based on the LLaVA-OneVision-0.5B model, with a parameter scale of 0.5B, capable of embedding text, images, multiple images, and videos.

Multimodal Fusion

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase